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Abstract 

Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In 
this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting 
where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The 
main challenge for statistical models in this context is to overcome the inherent data sparseness and to 
estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample 
set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical 
data organization. A novel family of mixture models is proposed which explain the observed data by a 
finite number of shared aspects or clusters. This provides a common framework for statistical inference 
and structure discovery and also includes several recently proposed models as special cases. Adopting 
the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop 
improved versions of EM which largely avoid overhtting problems and overcome the inherent locality of 
EM-based optimization. Among the broad variety of possible applications, e.g., in information retrieval, 
natural language processing, data mining, and computer vision, we have chosen document retrieval, the 
statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images 
to test and evaluate the proposed algorithms. 
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1 Introduction 

The ultimate goal of statistical modeling is to explain 
observed data with a probabilistic model. In order to 
serve as a useful explanation, the model should reduce 
the complexity of the raw data and has to offer a certain 
degree of simplification. In this sense statistical modeling 
is related to the information theoretic concept of min¬ 
imum description length [47, 48]. A model is a ‘good’ 
explanation for the given data if encoding the model 
and describing the data conditioned on that model yields 
a significant reduction in encoding complexity as com¬ 
pared to a ‘direct’ encoding of the data. 

Complexity considerations are in particular relevant 
for the type of data investigated in this paper which is 
best described by the term co-occurrence data (COD) 
[30, 9]. The general setting is as follows: Suppose two 
finite sets X = {xi, . . ., x jy} and y = {t/i, . . ., j/m} of 
abstract objects with arbitrary labeling are given. As el¬ 
ementary observations we consider pairs (xi, yjj £ X xy, 
i.e., a joint occurrence of object Xi with object yj. All 
data is numbered and collected in a sample set S = 
{( x i(r), yj(r), r ) '■ l < r < L} with arbitrary ordering. 
The information in S is completely characterized by its 
sufficient statistics riij = \{(xi, yj, r) £ i5}| which mea¬ 
sure the frequency of co-occurrence of Xi and yj . An im¬ 
portant special case of COD are histogram data, where 
each object Xi £ X is characterized by a distribution 
of measured features yj £ y. This becomes obvious 
by partitioning S into subsets Si according to the X- 
component, then the sample set Si represent an empir¬ 
ical distribution rijp over y, where rijp = riij/rii and 
rii = |S;|. 

Co-occurrence data is found in many distinctive ap¬ 
plication. An important example is information retrieval 
where X may correspond to a collection of documents 
and y to a set of keywords. Hence riij denotes the num¬ 
ber of occurrences of a word yj in the (abstract of) doc¬ 
ument Xi. Or consider an application in computational 
linguistics, where the two sets correspond to words be¬ 
ing part of a binary syntactic structure such as verbs 
with direct objects or nouns with corresponding adjec¬ 
tives [22, 43, 10]. In computer vision, X may correspond 
to image locations and y to (discretized or categorical) 
feature values. The local histograms rijp in an image 
neighborhood around Xi can then be utilized for a sub¬ 
sequent image segmentation [24]. Many more examples 
from data mining, molecular biology, preference analysis, 
etc. could be enumerated here to stress that analyzing 
co-occurrences of events is in fact a very general and 
fundamental problem of unsupervised learning. 

In this contribution a general statistical framework 
for COD is presented. At a first glance, it may seem 
that statistical models for COD are trivial. As a conse¬ 
quence of the arbitrariness of the object labeling, both 
sets only have a purely nominal scale without ordering 
properties, and the frequencies riij capture all we know 
about the data. However, the intrinsic problem of COD 
is that of data sparseness, also known as the zero fre¬ 
quency problem [19, 18, 30, 64]. When N and M are very 
large, a majority of pairs (aq, yj) only have a small prob¬ 
ability of occurring together in S. Most of the counts 


riij will thus typically be zero or at least significantly 
corrupted by sampling noise, an effect which is largely 
independent of the underlying probability distribution. 
If the normalized frequencies are used in predicting fu¬ 
ture events, a large number of co-occurrences is observed 
which are judged to be impossible based on the data S. 
The sparseness problem becomes still more urgent in the 
case of higher order COD where triplets, quadruples, etc 
are observed. 1 Even in the domain of natural language 
processing where large text corpora are available, one has 
rarely enough data to completely avoid this problem. 

Typical state-of-the-art techniques in natural lan¬ 
guage processing apply smoothing techniques to deal 
with zero frequencies of unobserved events. Prominent 
techniques are, for example, the back-off method [30] 
which makes use of simpler lower order models and model 
interpolation with held-out data [28, 27]. Another class 
of methods are similarity-based local smoothing tech¬ 
niques as, e.g., proposed by Essen and Steinbiss [14] and 
Dagan et al. [9, 10]. An empirical comparison of smooth¬ 
ing techniques can be found in [7]. 

In information retrieval, there have been essentially 
two proposals to overcome the sparseness problem. The 
first class of methods relies on the cluster hypothesis 
[62, 20] which suggests to make use of inter-document 
similarities in order to improve the retrieval perfor¬ 
mance. Since it is often prohibitive to compute all pair¬ 
wise similarities between documents these methods typ¬ 
ically rely on random comparisons or random fraction¬ 
ation [8]. The second approach focuses on the index 
terms to derive an improved feature representation of 
documents. The by far most popular technique in this 
category is Salton’s Vector Space Model [51, 58, 52] of 
which different variants have been proposed with differ¬ 
ent word weighting schemes [53]. A more recent variant 
known as latent semantics indexing [12] performs a di¬ 
mension reduction by singular value decomposition. Re¬ 
lated methods of feature selection have been proposed 
for text categorization, e.g., the term strength criterion 
[ 66 ], 

In contrast, we propose a model-based statistical ap¬ 
proach and present a family of finite mixture models 
[59, 35] as a way to deal with the data sparseness prob¬ 
lem. Since mixture or class-based models can also be 
combined with other models our goal is orthogonal to 
standard interpolation techniques. Mixture models have 
the advantage to provide a sound statistical foundation 
with the calculus of probability theory as a powerful in¬ 
ference mechanism. Compared to the unconstrained ta¬ 
ble count ‘model’, mixture models offer a controllable 
way to reduce the number of free model parameters. As 
we will show, this significantly improves statistical infer¬ 
ence and generalization to new data. The canonical way 
of complexity control is to vary the number of compo¬ 
nents in the mixture. Yet, we will introduce a different 
technique to avoid overhtting problems which relies on 
an annealed generalization of the classical EM algorithm 
[13]. As we will argue, annealed EM has some additional 
advantages making it an important tool for fitting mix- 

1 Word ra-gram models are examples of such higher order 
co-occurrence data. 



ture models. 

Moreover, mixture models are a natural framework 
for unifying statistical inference and clustering. This is 
particularly important, since one is often interested in 
discovering structure, typically represented by groups of 
similar objects as in pairwise data clustering [23]. The 
major advantage of clustering based on COD compared 
to similarity-based clustering is the fact that it does not 
require an external similarity measure, but exclusively 
relies on the objects occurrence statistics. Since the 
models are directly applicable to co-occurrence and his¬ 
togram data, the necessity for pairwise comparisons is 
avoided altogether. 

Probabilistic models for COD have recently been in¬ 
vestigated under the titles of class-based n-gram models 
[4], distributional clustering [43], and aggregate Markov 
models [54] in natural language processing. All three 
approaches are recovered as special cases in our COD 
framework and we will clarify the relation to our ap¬ 
proach in the following sections. In particular we dis¬ 
cuss the distributional clustering model which has been 
a major stimulus for our research in Section 3. 

The rest of the paper is organized as follows: Section 
2 introduces a mixture model which corresponds to a 
probabilistic grouping of object pairs. Section 3 then fo¬ 
cuses on clustering models in the strict sense, i.e., models 
which are based on partitioning either one set of objects 
(asymmetric models) or both sets simultaneously (sym¬ 
metric models). Section 4 presents a hierarchical model 
which combines clustering and abstraction. We discuss 
some improved variants of the standard EM algorithm 
in Section 5 and finally apply the derived algorithms to 
problems in information retrieval, natural language pro¬ 
cessing, and image segmentation in Section 6. 

2 Separable Mixture Models 

2.1 The Basic Model 

Following the maximum likelihood principle we first 
specify a parametric model which generates COD over 
X x y, and then try to identify the parameters which 
assign the highest probability to the observed data. The 
first model proposed is the Separable Mixture Model 
(SMM). Introducing K abstract classes C a the SMM gen¬ 
erates data according to the following scheme: 

1. choose an abstract class C a according to a distri¬ 
bution 7T„ 

2. select an object X{ E X from a class-specific condi¬ 
tional distribution pi\ a 

3. select an object yj E y from a class-specific condi¬ 
tional distribution qj\ a 

Note that steps 2. and 3. can be carried out indepen¬ 
dently. Hence, X{ and yj are conditionally independent 
given the class C a and the joint probability distribution 
of the SMM is a mixture of separable component distri¬ 


butions which can be parameterized by 2 
K K 

Pij=P(xi, y j ) = '^2Tr a P(x i , yj\a) = ’^2TT a Pi\ a q j \ a . ( 1 ) 

a=l a=l 

The number of independent parameters in the SMM is 
(N + M — l)K — 1. Whenever K <C min{#, M} this is 
significantly less than a complete table with NM entries. 
The complexity reduction is achieved by restricting the 
distribution to linear combinations of K separable com¬ 
ponent distributions. 

2.2 Fitting the SMM 

To optimally fit the SMM to given data S we have to 
maximize the log-likelihood 

N M / K \ 

£ = ( ^2^aPi\aQj\a ) ( 2 ) 

i = l j = 1 \a = l / 

with respect to the model parameters 9 = (ir,p,rq). To 
overcome the difficulties in maximizing a log of a sum, 
a set of unobserved variables is introduced and the cor¬ 
responding EM algorithm [13, 36] is derived. EM is a 
general iterative technique for maximum likelihood esti¬ 
mation, where each iteration is composed of two steps: 

• an Expectation (E) step for estimating the unob¬ 
served data or, more generally, averaging the com¬ 
plete data log-likelihood, 

• and a Maximization (M) step, which involves max¬ 
imization of the expected log-likelihood computed 
during the E-step in each iteration. 

The EM algorithm is known to increase the likelihood 
in every step and converges to a (local) maximum of C 
under mild assumptions, cf. [13, 35, 38, 36]. 

Denote by R ra an indicator variable to represent 
the unknown class C a from which the observation 
(xj( r p yj( r ), r) E S was generated. A set of indicator 
variables is summarized in a Boolean matrix R E 1Z, 
where 

R = l^R = (Rra) ■■ J^Rra = 1 j (3) 

denotes a space of Boolean assignment matrices. R effec¬ 
tively partitions the sample set S into K classes. Treat¬ 
ing R as additional unobserved data, the complete data 
log-hkehhood is given by 
L K 

Pra (log TTq: +logPi( r) \ a + log qj( r )\a) (4) 

r— 1 a=l 

and the estimation problems for 7r, p and q decouple for 
given R. 

The posterior probability of R ra for a given parameter 
estimate 9^ (E-step) is computed by exploiting Bayes’ 
rule and is in general obtained by 

(Rra) = P(Rra = l\0,S) 

cx P(S\6,R ra = l)P(R ra = l\6). (5) 

2 The joint probability model in (1) was the starting point 
for the distributional clustering algorithm in [43], however 
the authors have in fact restricted their investigations to the 
(asymmetric) clustering model (cf. Section 3). 



Thus for the SMM ( R ra ) is given in each iteration by 


JO J*> 


(R r 


\(* +1) _ 




»(r)|a 'ij(r)\ a 


V 

£^1/ = 1 


( 6 ) 


Pi 


'ij [r 


The M-step is obtained by differentiation of (4) using 
(6) as an estimate for R ra and imposing the normaliza¬ 
tion constraints by the method of Lagrange multipliers. 
This yields a (normalized) summation over the respec¬ 
tive posterior probabilities 


and 


t r <*> = 

a 

r=l 

(7) 

-0) 
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T a) 22 ^ Rra ^ 
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= 
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(9) 


Iterating the E- and M-step, the parameters converge 
to a local maximum of the likelihood. Notice, that it is 
unnecessary to store all L ■ K posteriors, as the E- and 
M-step can be efficiently interleaved. 

To distinguish more clearly between the different 
models proposed in the sequel, a representation in terms 
of directed graphical models (belief networks) is utilized. 
In this formalism, random variables as well as parame¬ 
ters are represented as nodes in a directed acyclic graph 
(cf. [41, 32, 17] for the general semantics of graphical 
models). Nodes of observed quantities are shaded and a 
number of i.i.d. observations is represented by a frame 
with a number in the corner to indicate the number of 
observations (called a plate). A graphical representation 
of the SMM is given in Fig. 1. 



Figure 1: Graphical model representation for the sym¬ 
metric parameterization of the Separable Mixture Model 
(SMM). 


2.3 Asymmetric Formulation of the SMM 

Our specification of the data generation procedure and 
the joint probability distribution in (1) is symmetric in 
X and y and does not favor an interpretation of the ab¬ 
stract classes C a in terms of clusters of objects in X or 


y. The classes C a correspond to groups of pair occur¬ 
rences which we call aspects. As we will see in compari¬ 
son with the cluster-based approaches in Section 3 this is 
different from a ‘hard’ assignment of objects to clusters, 
but differs also from a probabilistic clustering of objects. 
Different observations involving the same Xi E X (or 
Vj E y) can be explained by different aspects and each 
objects has a particular distribution over aspects for its 
occurrences. This can be stressed by reparameterizing 
the SMM with the help of Bayes’ rule 

N 

Pi = 22 Pi g 71 "’ and Pa i* = R| " • ( 10 ) 

i =1 Pi 

The corresponding generative model, which is in fact 
equivalent to the SMM, is illustrated as a graphical 
model in Fig. 2. It generates data according to the fol¬ 
lowing scheme: 

1. select an object Xi E X with probability pi 

2. choose an abstract class C a according to an object- 
specific conditional distribution p a \i 

3. select an object yj E T from a class-specific condi¬ 
tional distribution qj\ a 

The joint probability distribution of the SMM can thus 
be parameterized by 

Pij = P{xi,yj) = qj^^Poim* • ( U ) 

a 

Hence a specific conditional distribution qj^ defined on 
y is associated with each object Xi, which can be under¬ 
stood as a linear combination of the prototypical condi¬ 
tional distributions qj\ a weighted with probabilities p a \i 
(cf. [43]). Noticdy that although p a \i defines a proba¬ 
bilistic assignment of objects to classes, these probabili¬ 
ties are not induced by the uncertainty of a hidden class 
membership of object x^ as is typically the case in mix¬ 
ture models. In the special case of X = y the SMM 
is equivalent to the word clustering model of Saul and 
Pereira [54] which has been developed parallel to our 
work. Comparing the graphical models in Fig. 1 and 
Fig. 2 it is obvious that the reparametrization simply 
corresponds to an arc reversal. 

2.4 Interpreting the SMM in terms of Cross 
Entropy 

To achieve a better understanding of the SMM con¬ 
sider the following cross entropy (Kullback-Leibler di¬ 
vergence) D between the empirical conditional distribu¬ 
tion n.j\i = rtij/rti of yq given Xi and the conditional qj^ 
implied by the model, 

M 

D [ n 3\Mi\i\=22 n i\i iogjr 1 ( 12 ) 

i=l 

M i M 

= J2 n i I ’■ —'ll n a ^JlPaimia- 

j= 1 ’’ j= 1 « 

Note, that the first entropy term does not depend on 
the parameters. Let a.(S) = XI?: n i n j\i log and 
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Figure 2: Graphical model representation for the Sepa¬ 
rable Mixture Model (SMM) using the asymmetric rep¬ 
resentation. 

rewrite the observed data log-likelihood in (2) as 

N N 

£ = - ^2 Hi D [ n j\Mj\i] + X] 

8=1 8=1 

Since the estimation of pi = n,; can be carried out in¬ 
dependently, the remaining parameters are obtained by 
optimizing a sum over cross entropies between condi¬ 
tional distributions weighted with np y the frequency of 
appearance of Xi in 5, i.e., by minimizing the cost, func¬ 
tion 

N 

7i = J2 n ’D[nj\i\qj\i] ■ (14) 

«=i 

Because of the symmetry of the SMM an equivalent de¬ 
composition is obtained by interchanging the role of the 
sets X and y. 

2.5 Product-Space Mixture Model 

The abstract classes C a of a fitted SMM correspond to as¬ 
pects of observations, i.e., pair co-occurrences, and can¬ 
not. be directly interpreted as a. probabilistic grouping in 
either data, space. But. often we may want, to enforce a. 
sim ultaneous interpretation in terms of groups or proba¬ 
bilistic clusters in both sets of objects, because this may 
reflect, prior belief or is part, of the task. This can be 
achieved by imposing additional structure on the set. of 
labels to enforce a. product, decomposition of aspects 

{1, • • •, A} = {1,..., Kx } x {1,..., I\y } . (15) 

Each element, a is now uniquely identified with a. multi- 
index {v a ,fi, a ). The resulting Product-Space Mixture 
Model (PMM) has the joint, distribution 

K 

Pij = (16) 

a=l 

Here 7r„ = 7r„ Q|[lQ is the probability to generate an obser¬ 
vation from a. specific pair combination of clusters from 
X and y. The difference between the SMM and the 


PMM is the reduction of the number of conditional dis¬ 
tributions pi\ a and qj\ a which also reduces the model 
complexity. In the SMM we have conditional distribu¬ 
tions for each class C a , while the PMM imposes addi¬ 
tional constraints, pi\ a = Pi\fj, if v a = vp and qj\ a = qj\fj, 
if p a = pp . This is illustrated by the graphical model in 
Fig. 3. 



Figure 3: Graphical representation for the Product. 
Space Mixture Model (PMM). 

The PMM with K,y T’-cla.sses and Ky T-cla.sses is 
thus a. constrained SMM with K = KyKy abstract, 
classes. The number of independent, parameters in the 
PMM reduces to 

(K x K y - 1) + K X (N - 1) + K y (M - 1) . (17) 

Whether this is an advantage over the unconstrained 
SMM depends on the specific data, generating process. 
The only difference in the fitting procedure compared to 
the SMM occurs in the M-st.ep by substituting 

'X (Rra) {t \ ( 18 ) 

r:i(r)=i or.v a —v 

4’ * E E ( R ™Y t] ■ ( 10 ) 

r-jfr)=j «U‘a =;i 

The E-st.ep has to be adapted with respect, to the mod¬ 
ified labeling convention. 

3 Clustering Models 

The grouping structure inferred by the SMM corre¬ 
sponds to a. probabilistic partitioning of the observation 
space X x y. Although the conditional probabilities p a p 
and q a \j can be interpreted as class membership proba¬ 
bilities of objects in X or T, they more precisely corre¬ 
spond to object-specific distributions over aspects. Yet., 
depending on the application at. hand it. might, be more 
natural to assume a. typically unknown, but. nevertheless 
definitive assignment, of objects to clusters, in particular 
when the main interest, lies in extracting grouping struc¬ 
ture in X and/or y as is often the case in exploratory 
data, analysis tasks. Models where each object, is as¬ 
signed to exactly one cluster are referred to as clustering 
models in the strict, sense and they should be treated as 
models in their own right.. As we will demonstrate they 
have the further advantage to reduce the model com¬ 
plexity compared to the a.spftct.-ba.sed SMM approach. 






3.1 Asymmetric Clustering Model 

A modification of the SMM leads over to the Asymmet¬ 
ric Clustering Model (ACM). In the original formulation 
of the data generation process for the SMM the assump¬ 
tion was made that each observation (xi, yj) is generated 
from a class C a according to the class-specific distribu¬ 
tion Pi\ a qj\ a or, equivalently, the conditional distribu¬ 
tion qj\i was a linear combination of probabilities qj\ a 
weighted according to the distribution p a \i- Now we re¬ 
strict this choice for each object Xi to a single class. This 
implies that all yj occuring in observations ( Xi,yj ) in¬ 
volving the same object Xi are assumed to be generated 
from an identical conditional distribution qj\ a - Let us 
introduce an indicator variable p a for the class member¬ 
ship which allows us to specify a probability distribution 

by 


K 

P(xi,yj\I,p,q) = Pi^2liaqj\a ■ (20) 

a = l 


The ACM can be understood as a SMM with p a re¬ 
placing p a \i- The model introduces an asymmetry by 
clustering only one set of objects X , while fitting class 
conditional distributions for the second set y. Obvi¬ 
ously, we can interchange the role of X and y and may 
obtain two distinct models. 

For a sample set S the log-likelihood is given by 


N 


N K 


M 


c = i°g Pi + YY Iia Y n ij lo g qj\c 


( 21 ) 


2 = 1 


2 = 1 a = l 


3= 1 


The maximum likelihood equations are 

Pi = rii/L , (22) 

1 if a = argmin^ D[npi\qp v } 

0 else, 


(23) 


dj | a — 


Mi = 1 1 ia n ij 

j 

Z^z -1 Ua n i 


N 




/ -j j 3 

2=1 l^k = 1 -t-karik 


\i ■ (24) 


The class-conditional distributions qj\ a are linear super¬ 
positions of all empirical distributions of objects Xi in 
cluster C a . Eq. (24) is thus a simple centroid condi¬ 
tion, the distortion measure being the cross entropy or 
Kullback-Leibler divergence. In contrast to (13) where 
the maximum likelihood estimate for the SMM mini¬ 
mizes the cross entropy by fitting Pi\ a , the cross entropy 
serves as a distortion measure in the ACM. The update 
scheme to solve the likelihood equations is structurally 
very similar to the 77-means algorithm: calculate as¬ 
signments for given centroids according to the nearest 
neighbor rule and recalculate the centroid distributions 
in alternation. 

The ACM is in fact similar to the distributional clus¬ 
tering model proposed in [43] as the minimization of 


N K 

n = YY IivD\rij\i\qj\ v \ . (25) 

2=1 U=1 


the fact that the centroid equation (24) is satisfied at 
stationary points 3 . Yet, since after dropping the inde¬ 
pendent pi parameters and a data dependent constant 
we arrive at 

N K 

P — D\rij\i\qj |ct] 7 ( 2 b) 

2=1 a =1 

showing that the choice of the KL-divergence simply fol¬ 
lows from the likelihood principle. We like to point out 
the non-negligible difference between the distributional 
clustering cost function in (25) and the likelihood in (26), 
which weights the object specific contributions by their 
empirical frequencies n 8 -. This implies that objects with 
large sample sets Si have a larger influence on the opti¬ 
mization of the data partitioning, since they account for 
more observations, as opposed to a constant influence in 
the distributional clustering model. 


3.2 EM algorithm for Probabilistic ACM 

Instead of interpreting the cluster memberships 7 sa as 
model parameters, we may also consider them as unob¬ 
served variables. In fact, this interpretation is consistent 
with other common mixture models [35, 59] and might 
be preferred in the context of statistical modeling, in 
particular if N scales with L. Therefore consider the 
following complete data distribution 


P{S,I\p,p,q) = 

P(S\I,p,q)P(I\p), 

AT 

(27) 

P(I\p) = 

1\ 

n p° a ■ 

(28) 


8=1 


Here p specifies a prior probability for the hidden vari¬ 
ables, P(I ia = 1| p) = p a . 

The introduction of unobservable variables 7 sa yields 
an EM-scheme and replaces the arg min-evaluation in 
(23) with posterior probabilities 


(■ Iia) (t+1) = 


s(‘)tt m (AC 
Pa llj=l ylj\a 


Ef =1 pP n f =1 («# 

pEexp (-niD^njiilq^lYj 
Ef=iZ } exp (-riiD npi\q^l 


(29) 

• (30) 


The M-step is equivalent to (24) with posteriors replac¬ 
ing Boolean variables. Finally, the additional mixing 
proportions are estimated in the M-step by 


P 


M - 


N 




2 = 1 


(31) 


Notice the crucial difference compared to SMM poste¬ 
riors in ( 6 ): Since all indicator variables R ra belonging 
to the same Xi are identified, the likelihoods for obser¬ 
vations in Si are collected in a product before they are 
suitably normalized. To illustrate the difference consider 
the following example. Let a fraction s of the observed 


In distributional clustering, the KL-divergence as a dis¬ 
tortion measure for distributions has been motivated by 


5 


3 This is in fact not a unique property of the KL 
divergence as it is also satisfied for the Euclidean distance. 



pairs involving Xi be best explained by assigning Xi to 
C a while the remaining fraction 1 — s of the data is best 
explained by assigning it to C v . Fitting the SMM model 
approximately results in p a \i = s and p v \i = 1 — s, ir¬ 
respective of the number of observations, because pos¬ 
teriors (I ra ) are additively collected. In the ACM all 
contributions first, enter a huge product. In particular, 
for r>i —*■ oo the posteriors (/,;„} approach Boolean val¬ 
ues and automatically result in a hard partitioning of 
X. Compared to the original distributional clustering 
model as proposed in [43] our maximum likelihood ap¬ 
proach naturally includes additional parameters for the 
mixing proportions p. 

Notice that in this model, to which we refer as prob¬ 
abilistic ACM, different observations involving the same 
object Xi are actually not independent, even if condi¬ 
tioned on the parameters ( p,p,q ). As a consequence, 
considering the predictive probability for an additional 
observation s = ( x^ , yj , L + 1) requires to condition on 
the given sample set N, more precisely on the subset Si, 
which yields 


K 

P(s\S, p,p, q) = ^2 P(s\Iia=l,P,1) P(Ii a =l\Si,p, q) 

a — 1 

K 

— Pi ^ ' qj\a(Jiot) • (32) 

a = l 

Thus we have to keep the posterior probabilities in ad¬ 
dition to the parameters (p, q) in order to define a (pre¬ 
dictive) probability distribution on co-occurrence pairs. 
The corresponding graphical representation in Fig. 4 
stresses the fact that observations with identical X— 
objects are coupled by the hidden variable 4 . 



Figure 4: Graphical representation for the Asymmetric 
Clustering Model (ACM). 


4 Not.ice that n, is not interpreted as a random variable, 
but treated as a given quantity. 


3.3 Symmetric Clustering Model 

We can classify the models discussed so far w.r.t. the 
way they model p a \i’ (i) as an arbitrary probability dis¬ 
tribution (SMM), (ii) as an unobserved hidden variable 
(p a \i = (Iia), probabilistic ACM), (iii) as a Boolean vari¬ 
able (p a \i = Iia G {0, 1}, hard ACM). On the other 
hand, no model imposes restrictions on the conditional 
distributions qj\ a - This introduces the asymmetry in 
ACM, where p a \i and hence pi\ a is actually restricted, 
while qj\ a is not. However, it also indicates a way to 
derive symmetric clustering models, namely by impos¬ 
ing the same constraints on conditional distributions pi\ a 
and qj\ a - Unfortunately, naively modifying the SMM for 
both object sets simultaneously does not result in a rea¬ 
sonable model. Introducing indicator variables and 
■Jja to replace p a \i and q a y yields a joint probability 

K 

Pij oc piqj y ' C. Jja ~a (33) 

a = 1 

which can be normalized to yield a valid probability dis¬ 
tribution, but which is zero for pairs (xi,yj), whenever 
^2 Iia Jja = 0 and therefore results in zero probabilities 
for each co-occurrence of pairs not belonging to the same 
cluster C a - 

For the PMM, however, the corresponding clustering 
model is more interesting. Let us introduce cluster asso¬ 
ciation parameters c v ,, and define the joint probability 
distribution of the symmetric clustering model (SCM) by 

Pij — Pi qj y ' h n -lj //i (34) 


where we have to impose the global normalization con¬ 
straint. 

N M N M 

EE Pij = EE Pi qj y ' hu -lj ii ( j/ji — 1 . (35) 

j=l j=l i=l j = l 1/ , j-i 

In the sequel, we will add more constraints to break cer¬ 
tain invariances w.r.t.. multiplicative constants, but. for 
now no restrictions besides (35) are enforced. 

Introducing a, Lagrange multiplier A results in the fol¬ 
lowing augmented log-likelihood 

C = ^ n-ij ^ liv Jjn (log Pi + log qj + log c vfl ) 

ij 

+ A I ^ ^ Pi Qj ^ ^ Ijv Jj jj,Cy n 1 I • (36) 

\ ®j j 


The corresponding stationary equations for the maxi¬ 
mum likelihood estimators are 


rij 


Pi = 
Qj = 




c p u Qj ^ 


v,ii 
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3 V 


^ Jjn c vii PiUv 

l^i = 1 2—/j =1 n ij 1 iv'Jjn 

MT/ZiPiii^iT/tLiqjJjn)' 


= E 


i 3 j 


(37) 

(38) 

(39) 




together with (35) which is obtained from (36) by differ¬ 
entiation with respect to A. Substituting the right-hand 
side of (39) into (35) results in the following equation for 

A, 


A — riij UvJjn — L . (40) 

i,j v,n 


Inserting A = —L into (39) gives an explicit expression 
for c vfJ which depends on p and q. Substituting this 
expression back into (37,38) yields 


Pi 



l^k ^kv 


(41) 


and an equivalent expression for qj. Observing that the 
fraction on the right-hand side in (41) does not depend 
on the specific index i, the self-consistency equations can 
be written as pi = rn ^ hv^v It is straightforward to 
verify that: (i) any choice of constants a v £ IR + gives a 
valid solution and (ii) each such solution corresponds to 
a (local) maximum of C. This is because a simultaneous 
re-scaling of all pi for which Ii v = 1 is compensated by a 
reciprocal change in c vfJ as is obvious from (34) or (39), 
leaving the joint probability unaffected. 

We break this scale invariance by imposing the ad¬ 
ditional conditions ^iPihv = 7r(; = JA niCv/T and 
JA QjJjfj = 7r^ = JA nijJjfi/L on the parameters. This 
has the advantage to result in the simple estimators 
pi = rii/L and qj = rrij/L, respectively. The proposed 
choice decouples the estimation of p and q from all other 
parameters. Moreover it supports an interpretation of pi 
and qj in terms of marginal probabilities, while 7r(; and 
7r^ correspond to occurrence frequencies of objects from 
a particular cluster. With the above constraints the final 
expression for the maximum likelihood estimates for c vfl 
becomes 

c V n = ~^TT’ where ( 42 ) 

N M 

= Y.Y.'-t 1 ( 43 ) 

i= 1j=l 


7 T vfl can be interpreted as an estimate of the joint proba¬ 
bility for the cluster pair (v, p). Notice that the auxiliary 
parameters are related by marginalization ^ 
and 'Yln,' K viJ = p- The maximum likelihood estimate 
c vfl is a quotient of the joint frequency of objects from 
clusters C% and and the product of the respective 
marginal cluster frequencies. If we treat c vfl as func¬ 
tions of I, J and insert (42) into (36), this results in 
a term which represents the average mutual informa¬ 
tion between the random events Xi £ C x and yj £ . 

Maximizing C w.r.t. I and J thus maximizes the mutual 
information which is very satisfying, since it gives the 
SCM a precise interpretation in terms of an information 
theoretic concept. A similar criterion based on mutual 
information has been proposed by Brown et al. [4] in 
their class-based n-gram model. More precisely their 
model is a special case of the (hard clustering) SCM, 
where formally X = y and Ii v = J 8 '„. 5 


The coupled K-means like equations for either set of 
discrete variables are obtained by maximizing the aug¬ 
mented likelihood in (36) from which we deduce 

JW-{ J ,„„h (44) 


L 


V /J 


faiv — ^ ^ ^ ^ Jj n 

j V 

The expression for hi v further simplifies, because 


Y.K: 


‘ V /J 




‘ V /J 


= rii (45) 


E n i m j T ~ 

J J 3n c vn : 

is a constant independent of the cluster index v which 
can be dropped in the maximization in (44). It has to 
be stressed that the manipulations in (45) have made 
use of the fact that the J-variables appearing in (42) 
and (45) can be identified; the c-estimates have thus to 
be J-consistent 6 . This can be made more plausible by 
verifying that for given J and J-consistent parameters, 
the marginalization JA pij = pi holds independently of 
the specific choice of I. This automatically ensures the 
global normalization since JA pi = 1 which in turn ex¬ 
plains why the global constraint does not affect the op¬ 
timal choices of I according to (44). Similar equations 
can be obtained for J, 

f _ /1 if a = arg max,, J2i n ij £„ hu log c vfl 
~ I 0 else 


The nature of the simultaneous clustering suggests an al¬ 
ternating minimization procedure where the T-partition 
is optimized for a fixed ^-partition and vice versa. Af¬ 
ter each update step for either partition the estimators 
c vfJ have to be updated in order to ensure the validity of 
(45), i.e., the update sequence (J, c, J, c) is guaranteed 
to increase the likelihood in every step. Notice that the 
cluster association parameters effectively decouple the 
interactions between assignment variables /,■„, Ik v for 
different Xi, Xk and between assignment variables J J|U , 
J/ M for different yj , yj. Although we could in principle 
insert the expression in (36) directly into the likelihood 
and derive a local maximization algorithm for I and J 
(cf. [4]), this would result in much more complicated 
stationary conditions than (44). The decoupling effect is 
even more important for the probabilistic SCM derived 
in the next section. 


3.4 Probabilistic SCM 


As for the ACM, we now investigate the approach pre¬ 
ferred in statistics and treat the discrete I and J as hid¬ 
den variables. The situation is essentially the same as 
for the ACM. The complete data distribution for the 
probabilistic SCM (PSCM) is given by 


p(s,i, j\ P *y,c) 


IT 

i,j v,n 


(46) 


classes are implicitly utilized in two different ways: as classes 
for predicting and for being predicted. It is not obvious that 
this is actually an unquestionable choice. 

6 Hereby we mean that depends on J by the formula in 
(42), where I can be an arbitrary partitioning of the T-space. 


5 For the bigram model in [4] this implies that the word 
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n pty, 1 ^ 


n / Y.- h i r ' ] 


and the predictive probability for s = ( Xi , yj , L + 1) now 
involves joint posteriors, 


P(s\S, ...) = pi qj £ ij-iv J j jU } Cv n • (47) 

v,n 


The M-step equations are obtained from (42) by replac¬ 
ing (products of) Boolean variables by their posteriors. 
The estimates for the p-parameters are given by 


> V AT ’ r y ( 1 JT ■ v 


N 


M 


The coupling of I and J makes the exact computa¬ 
tion of posteriors in the E-step intractable. To preserve 
tractability of the procedure we propose to apply a fac¬ 
torial approximation (called mean-held approximation, 
cf. Appendix A), ~ {hv){Jjy), which results in 

the following approximations for the marginal posterior 
probabilities 


{I iv ) oc p x v exp 


n ij i°§ Cvp- 

j m 


(49) 


and a similar equation for (Jjfj). The mean-held equa¬ 
tions can be more intuitively understood as a soft-max 
versions of the hard clustering equations with additional 
priors p x , p y . Alternatively, one may also apply Markov 
chain Monte Carlo (MCMC) methods to approximate 
the required correlations. Yet, the mean-held approxi¬ 
mation has the advantage to be more efficient due to its 
deterministic nature. Notice that the mean-held condi¬ 
tions (49) form a highly non-linear, coupled system of 
equations. A solution is found by a hxed-point iteration 
which alternates the update of posterior marginals with 
an update of the continuous parameters ((7), ( c, p x ), (J), 
( c,p y ) sequence). This optimizes a common objective 
function in every step and always maintains a valid prob¬ 
ability distribution. 7 


3.5 Overview 

Altogether we have derived six different model types for 
COD 8 which are summarized in the following systematic 
scheme in Table 1. As can be seen the models span a 
large range of model complexity by imposing constraints 
on the class conditional distributions pi\ a and qj\ a - 


Model 

Pi\ a 

a = (v, y) 

Qj\a 

a = (v, y) 

SMM 

unconstr. 

unconstr. 

PMM 

Pi\ v 

1i\y 

(hard) ACM 

liotPi 

7T c* 

unconstr. 

(probabilistic) ACM 

(lic)Pi 

7T & 

unconstr. 

(hard) SCM 

livPi 

IT* 

Jip. Qj 
< 

(probabilistic) SCM 

(hv)Pi 

IT* 

{Jiti)Qi 

< 


Table 1: Systematic overview of presented COD models. 


T on the clusters to be given, e.g., a complete binary 
tree. Clusters C a are identihed with the terminal nodes 
ofT. Conditional probabilities qj\ a are not only attached 
to the leaves, but also to all inner nodes of the tree. The 
HACM involves two stages. In the hrst step, which is 
similar to the ACM case, each object Xi is assigned to 
one component or cluster C a represented by an (unob¬ 
served) variable Ii a . Now, instead of generating all rii 
observations from the conditional distribution qj\ a , a sec¬ 
ond probabilistic sampling step is involved by selecting a 
resolution or abstraction level. Hence, the Hrst step is a 
selection from a horizontal mixture of possible paths for 
each object Xi, while the second step is a probabilistic se¬ 
lection of a component from the vertical mixture of nodes 
on the selected path for each observation. To model the 
vertical selection we introduce a second set of hidden 
variables V rv which encode the resolution level A v for 
the r-th observation. Notice the different nature of both 
sets of variables: the I variables represent a partitioning 
in the X space similar to the ACM, while V partitions 
the co-occurrence space X x y like in the SMM. Obvi¬ 
ously, the hidden variables are not independent and have 
to fulfill the following set of constraints induced by T: 


4 Hierarchical Clustering Model 

4.1 Model Specification 

In this section we present a novel hierarchical generative 
model, called HACM. Assume therefore a tree topology 

7 The EM approach for SCM has problems when started 
from a random initialization, because all posteriors typically 
approach uniform distributions. A remedy which we applied 
is to utilize a solution of the ACM to initialize one arbitrarily 
selected set of hidden variables. 

8 We ignore the variants obtained from asymmetric models 
by reversing the role of X and y for simplicity. 


£ hr) a Vrv = 1, Vr, (50) 

a A „1C a 

where A v ] C a denotes the nodes A v ‘above’ C a , i.e., 
nodes on the path to C a . The constraints ensure that 
once X{ is assigned to a cluster C a all observation in Si 
are restricted to be generated from one of the abstraction 
nodes A v on the path above C a . A pictorial represen¬ 
tation can be found in Fig. 5: for an object Xi assigned 
to C a the choices for abstraction levels of observations 
( Xi,yj,r ) are restricted to the ‘active’ (highlighted) ver¬ 
tical path. The ‘tension’ in this model is imposed by the 
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Figure 5: Scheme for data generation with the HACM. 


constraints on In r ) a V rv as opposed to the independent 
choices R ra in the SMM. 

To complete the specification of the HACM one has 
to specify prior probabilities for V rv . The most general 
choice is to condition the priors on the terminal node 
selected by the I variables and on aq itself, Le., to intro¬ 
duce parameters t u uy. This assumes that each object 
has a specific distribution over abstraction levels. For 
simplicity, the constraints in (50) are incorporated in 
the prior by setting r v \ a i = 0 whenever A v j C a . With 
the above definitions the complete data log-likelihood of 
the HACM is given by 

T — ^ ' riij ^ ' Iia ^ ^ ' fry log T u |o,y Qj\u 

i,j a Si v 

+ E n i l °m + E E Iic,\ogp a . (51) 

i i a 


The corresponding representation in terms of a graphical 
model is shown in Fig. 6. 

We may think of the HACM as a mixture model with 
a horizontal mixture of clusters and a vertical mixture 
of abstraction levels. Each horizontal component is a 
mixture of vertical components on the path to the root, 
vertical components being shared by different horizontal 
components according to the tree topology. The HACM 
is more general than the ACM in that it allows to use a 
different abstraction level for each observation. The class 
conditional probability qj\ a is now modeled by a vertical 
mixture offering additional degrees of freedom. In fact 
the ACM is retrieved as a special case by setting r v \ a = 
8 uct . On the other hand, the HACM is more constrained 
than the SMM since the mixing of component densities 
qj\ a is restricted to nodes on a single path through the 
tree. This restriction precisely expresses why we obtain a 
hierarchical clustering organization of objects in X and, 
simultaneously, an abstractive organization of objects in 
y. The dual organization is in particular interesting for 
information retrieval where the HACM can be used to 
cluster documents (= X) in a hierarchical fashion and 
simultaneously assign abstraction levels to the different 
keywords (= T). 


4.2 EM Algorithm for the HACM 

Skipping the derivation of the maximum likelihood equa¬ 
tions for the hard clustering case of HACM, we directly 
continue with the probabilistic EM version. 

In the E-step we need to calculate posterior probabil¬ 
ities (Ii{ r ) a V rv ). Since the values of the clustering vari¬ 
ables Ii a restrict the admissible values of V rv , we may 
compute the joint posterior probabilities from the chain 
rule 


P(IHr)aV r u= l|5,fi) =P(V rv = 1| Si {r) ,9,I i{r)a = 1) 

P{Ii{r)a= ll'Sq,.), 0), (52) 

where 6 = {p,q, p,r) summarizes all continuous param¬ 
eters. The conditional posterior probabilities for V rv are 
given by 


(Vrv\a) (t + 1) 


T it) At) 
v\a,i(r)*j (r)\v 


E 




.,(*) 


/i ii\a,Ur)^j{r)\ii 


(53) 


The marginal posteriors can by definition be computed 
from 


( Iia)=Y P &<* = 

m 

which yields 

Si V 


(54) 


(55) 


completing the E-st.ep. 

In the M-st.ep all parameters are updated according 
to the following set of equations 


M = 

i 

(56) 


E E< F -i«) u E«) (t) 

(57) 


r'-j (r )=j a 


T l\l,i = 

W\^ {Vr ^ )m 

(58) 


Using the HACM in large-scale applications requires 
a closer investigation of the computational complexity 












Figure 6: Graphical representation for the Hierarchical 
Asymmetric Clustering Model (HACM). 

of the EM model fitting procedure. The major com¬ 
putational burden of the algorithm is the re-calculation 
of posterior probabilities {V rv \ a ) in the E-step. To ac¬ 
celerate the EM algorithm we exploit the fact that the 
hierarchical and abstractive organization does not criti¬ 
cally depend on the parameters r v \ a By this we mean, 
that even setting r v \ a y = const, (for all A v | C a ) will 
result in a reasonable model. In fact, to learn .zq-specific 
distribution of the vertical mixtures is more of a fine tun¬ 
ing which may improve the model performance once the 
essential structure has been identified. An intermediate: 
choice- which worked well in our experiments is to set 
T v \ a j = T v \ a , i.e., to introduce priors which are shared 
by all objects Xi belonging to the same cluster C a . The 
simplified E-st.ep has the advantage that {V rv \ a ) does not 
depend on Moreover it reduces the model complex¬ 

ity and may prevent overfitting. In the simplified model, 
instead of computing posteriors for all L observations, it 
suffices to compute posterior probabilities for all M y~ 
objects. This can result in a significant acceleration, for 
example, in natural language modeling where M would 
be the size of the vocabulary as compared to the size R 
of the corpus of word occurrences which typically differ 
by several orders of magnitude. For t v \ a = const, an 
additional speed-up results from a simplified propaga¬ 
tion of posteriors in the tree. In our simulation we have 
thus pursued a three-stage strategy, where the degrees 
of freedom are incrementally increased. 

Like for the ACM it is straightforward to check that 
the predictive probabilities for s = (, yj , L + 1) are 
given by 

— j); ' Py\j ! lj | u - Pv\j — ^ ' J (Iia)l~u\a > i • (59) 

v a 

4.3 Hierarchies and Abstraction 

With other hierarchical mixture models proposed for 
supervised [3, 29] and unsupervised learning [37] the 
HACM shares the organization of clusters in a tree struc¬ 
ture. It extracts hierarchical relations between clusters, 
i.e., it breaks the permutation-symmetry of the cluster 
labeling. Even more important, however, it is capable 
to perform statistical abstraction. Observations which 
are common to all clusters in the subtree rooted at an 


inner node are preferably captured at the level of general¬ 
ity represented by that node. Observations being highly 
specific are ‘explained’ on the terminal level. Therefore 
inner nodes do not represent a coarser view on the data 
which could be obtained, e.g., by a weighted combina¬ 
tion of the distributions at successor nodes and as might 
be expected for a hierarchical model. 9 The vertical mix¬ 
tures perform a specialization in terms of the level of gen¬ 
erality most adequate for each of the observations. Thus 
the HACM incorporates a novel notion of hierarchical 
modeling, which differs from multiresolution approaches, 
but also from other hierarchical concepts of unsupervised 
learning (e.g. [11]). It offers several new possibilities in 
data analysis and information retrieval tasks like extract¬ 
ing resolution-dependent, meaningful keywords for sub- 
collection of documents and gives a satisfying solution 
to the problem of cluster summarization (cf. [8]) since 
it explicitly finds the most characteristic terms for each 
(super-)clust.er of documents. 

There are several additional problems which have to 
be solved to arrive at a complete algorithm for the 
HACM. The most important concerns the specification 
of a procedure to obtain the t.ree-t.opology T. Before 
explaining our heuristic we have to introduce the impor¬ 
tant concept of annealing. 

5 Improved EM Variants 

5.1 Annealed EM 

So far we have mainly focused on the modeling prob¬ 
lem of defining mixture models for COD. The standard 
model fitting procedure has been the EM algorithm and 
its hard clustering variants. We now discuss two im¬ 
portant problems which naturally occur in this context. 
The first, problem is to avoid unfavorable local maxima, 
of the log-likelihood. The second, even more important, 
problem is to avoid overfit.t.ing, i.e., maximize the per¬ 
formance on unseen future data. The framework which 
allows us to improve the presented EM procedures in 
both aspects is known as deterministic annealing. 

Deterministic annealing has been applied to many 
clustering problems, including vectorial clustering [49, 
50, 5], pairwise clustering [23], and in the context, of 
COD for distributional clustering [43]. The key idea, is 
to introduce a. temperature parameter T and to replace 
the minimization of a. combinatorial objective function 
by a. substitute known as the free energy. Details on 
this topic are given in Appendix A. Here, we present, an¬ 
nealing methods without, reference to statistical physics. 
Consider therefore the general case of maximum likeli¬ 
hood estimation by the EM algorithm. The E-st.ep by 
definition computes a. posterior average of the complete 
data, log-likelihood which is maximized in the M-st.ep. 
The annealed E-st.ep at. temperature T performs this av¬ 
erage w.r.t. a. distribution which is obtained by generaliz¬ 
ing Bayes’ formula, such that, the likelihood contribution 
is taken to th# power of 1/T. For T > 1 this amounts 


9 In fact, it is also important, to develop hierarchical gen¬ 
eralizations of the kind. However, we focus on the HACM 
which is more specific to COD. 
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to increasing the effect of the prior which in turn will re¬ 
sult in a larger entropy of the (annealed) posteriors. For 
example, in the case of the ACM the annealed E-step 
generalizing (21) is given by 10 



2. it avoids overhtting by discounting the likelihood 
contribution in the E-step 

3. it offers a physically motivated heuristic to produce 
a meaningful tree topology (for the HACM). 

In all experiments and for all statistical models we have 
therefore utilized the annealed variant of EM. 

5.2 Predictive EM 


For (hard) clustering applications deterministic anneal¬ 
ing is utilized in its usual T —>■ 0 limit. Although there 
is no guarantee that deterministic annealing in general 
finds the global minimum, many independent empirical 
studies indicate that the typical solutions obtained are 
often significantly better than the corresponding ‘unan¬ 
nealed’ optimization. This is due to the fact that an¬ 
nealing is a homotopy method, where the (expected) like¬ 
lihood as a cost function is smoothed for large T and is 
recovered in the limit T —>■ 1. 

In addition, for fixed T > 1 the annealed E-step per¬ 
forms a regularization based on entropy, because the pos¬ 
terior probabilities minimize the generalized free energy 
at T = 1 which balances expected costs and (relative) 
entropy [38] (cf. Appendix A). This is the reason why 
annealed EM not only reduces the sensitivity to local 
minima but also controls the effective model complexity. 
It thereby has the potential to improve the generaliza¬ 
tion for otherwise overhtting models. The advantages of 
deterministic annealing are investigated experimentally 
in Section 6. 

In addition, the annealed EM algorithm offers a way 
to generate tree topologies. As is known from adaptive 
vector quantization [49], starting at a high value of T 
and successively lowering T leads through a sequence of 
phase transitions. At each phase transition the effective 
number of distinguishable clusters grows until some max¬ 
imal number is reached or the annealing is stopped. This 
suggests a heuristic procedure where we start with a sin¬ 
gle cluster and recursively split clusters. In the course of 
the annealing one keeps track of the splits and uses the 
‘phase diagram’ as a tree topology T. Note that merely 
the tree topology is successively grown, while the data 
partition obtained at a specific temperature is regrouped 
and may drastically change during the annealing process. 

To summarize, annealed EM solves three problems at 
once: 

1. It avoids unfavorable local minima by applying a 
temperature based continuation method, as the 
modified likelihood becomes convex at high tem¬ 
perature, 

10 Eq. (60) differs from the original formula by Pereira et 
al. [43] in that it scales the temperature with the frequency n % 
and includes the mixing proportions. As pointed out before 
this is naturally obtained in the ML framework, while in the 
distributional clustering cost function (25) the weights n t are 
not considered. Canceling these weights may be a reasonable 
approach to limit the effect of frequently observed ‘objects’ x t 
on the organization of clusters. From a statistical viewpoint, 
however, the latter is implausible, because more observations 
should automatically sharpen the posterior distributions at a 
given temperature level T. 


Another modification of the E-step to improve gener¬ 
alization is worth considering. For notational simplic¬ 
ity focus on the E-step in the SMM as given by (6). If 
the parameters are eliminated by substituting them with 
their current M-step estimators in terms of posteriors as 
given by (7-9), we arrive at 


(Rra) (t + 1) CX 


L Yu (Ru «) ( '*' ) 


(61) 


where i = i(r) and j = j(r). Eq. (61) reveals that in 
estimating ( R ra ) in the E-step its old estimator actu¬ 
ally appear on the right hand side for u = r. This 
has the effect to systematically overestimate high pos¬ 
terior probabilities while small posteriors are underes¬ 
timated. This positive feedback on the posteriors may 
lead to substantial overhtting phenomena. To illustrate 
this problem consider the extreme case, where rii = 1, 
e.g., (xi, yi, r) £ S is the only observation with X{. Then 
the stationary condition for the E-step is given by 


( R r a) — 


{Rra)dj\c 


Ya'(-^ ra ') ( lj\c 


(62) 


and hence qj\ a = Y a ' (Rra')<lj\a l whenever (Ri a ) > 0 
which is only fulfilled if ( R ra ) = 1 (being a stable solu¬ 
tion for a = argmax a < <tj\ a ')- For sparse data the diago¬ 
nal contribution can thus be dominant and the positive 
feedback bears the risk of overhtting. 

In order to overcome these problems we propose a 
variant of EM which we refer to as predictive EM. The 
only modihcation is to exclude the r-th observation in 
recalculating posteriors ( R ra ). The class membership of 
the r-th observation is predicted based on the remain¬ 
ing samples. 11 For the SMM this implies that diagonal 
contributions are excluded in (61). It is obvious that the 
proposed correction does only have a minor inhuence on 
the computational complexity of the htting procedure. 
Despite the heuristic havor caused by modifying an al¬ 
gorithmic step, the predictive EM can be motivated from 
strict optimization principles. Further details and con¬ 
vergence considerations are given in Appendix B. 

In conclusion we would like to stress the fact, that 
although the positive feedback occurs in other EM al¬ 
gorithms, it is most severe for COD models, because of 
the inherent sparseness problem. Furthermore, other er¬ 
ror measures like the squared error between a data vector 
and a cluster centroid are far less sensitive to this type of 
problems then is the cross entropy involving logarithms 
of small probabilities. 

11 This may result in undefined posterior probabilities due 
to zeros, as in the above example. In this case we assume the 
posterior to be uniform. 



5.3 Accelerated EM 

EM algorithms have important advantages over 
gradient-based methods, however for many problems the 
convergence speed of EM may restrict its applicability 
to large data sets. A simple way to accelerate EM al¬ 
gorithms is by overrelaxation in the M-step. This has 
been discussed early in the context of mixture models 
[44, 45] and was ‘rediscovered’ more recently under the 
title of EM(rj) in [1], We found this method useful in 
accelerating the fitting procedure for all discussed mod¬ 
els. Essentially the estimator for a generic parameter 9 
in the M-step is modified by 

9 {t+ ^ = (1 - ri)9^ + rj9 (t+1) , (63) 

where 0^ + is the M-step estimate, i.e., ry = 1 is the 
usual M-step. Choosing 1 < rj < 2 still guarantees con¬ 
vergence, and typically r) & 1.8 has been found to be 
a good choice to speed up convergence. In case that 
a constraint is violated after performing an overrelaxed 
M-step, the parameter set is projected back on the ad¬ 
missible parameter space. For an overview on more elab¬ 
orated acceleration methods for EM we refer to [36]. 

5.4 Multiscale Optimization 

Multiscale optimization [21, 46] is an approach for ac¬ 
celerating clustering algorithms whenever a topologi¬ 
cal structure exists on the object space(s). In image 
segmentation, for example, it is a natural assumption 
that adjacent image sites belong with high probabil¬ 
ity to the same cluster or image segment. This fact 
can be exploited to significantly accelerate the estima¬ 
tion process by maximizing over a suitable nested se¬ 
quence of variable subspaces in a coarse-to-Hne manner. 
This is achieved by temporarily tying adjacent sites in 
a joint assignment variable. For notational convenience 
we again restrict the presentation to the ACM, while ex¬ 
tensions to the SCM and both probabilistic variants are 
straightforward. 12 

More formally a coarsening hierarchy for X is given by 
a nested sequence of equivalence relations M^ over X , 
where MS 1 ^ C and Ad*- 0 - 1 = {(xi,Xi) : X{ E X}. 

In the context of image analysis these equivalence rela¬ 
tions typically correspond to multi-resolution pixel grids 
obtained by subsampling. The log-likelihood in (21) is 
minimized at coarsening level l by imposing constraints 
of the form Ii a = Ij a whenever ( Xi,Xj ) E . For all 
models under consideration this effectively amounts to 
reducing the number of indicator functions to one set of 
variables for each equivalence class in M^, while pre¬ 
serving the functional form of the likelihood, thus en¬ 
abling highly accelerated optimization at coarser levels. 
Once the maximization procedure at a resolution level l 
is converged, the optimization proceeds at the next level 
/ — 1 by prolongating the found solution in M^ to the 
subset defined by AW -1 - 1 , thus initializing the optimiza¬ 
tion at level l — 1 with the solution at level l. For the 
probabilistic version multiscale optimization amounts to 

12 Multiscale optimization in its current form is not appli¬ 
cable to the SMM/PMM. 
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0.14 

471 

0.60 

543 

32 

0.83 

386 

0.07 

452 

0.12 

438 

0.53 

506 

64 

0.79 

360 
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0.48 

477 

128 
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0.04 
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0.45 
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1 

- 
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- 

- 

- 

- 

- 

- 
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0.73 
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0.08 
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0.13 
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0.55 
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0.72 
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0.07 
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0.10 
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0.51 

335 

32 

0.71 
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0.07 
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0.08 

226 

0.46 

286 

64 

0.69 
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0.07 

223 

0.07 

204 

0.44 

272 

128 

0.68 

166 

0.06 

231 

0.06 

179 

0.40 

241 


Table 2: Perplexity results for different models (SMM, 
ACM, HACM, SCM) on two data sets (CRAN: predict¬ 
ing words conditioned on documents, PENN: predicting 
nouns conditioned on adjectives) based on ten-fold cross 
validation (K x = K y = K for SCM). 

modifying the E-step of the EM algorithm by computing 
posteriors for the reduced sets of indicator functions. 

We like to emphasize that in contrast to most mul¬ 
tiresolution optimization schemes, multiscale optimiza¬ 
tion has the advantage to maximize the original log- 
likelihood at all resolution levels. It is only the set of 
hidden variables which is effectively reduced by impos¬ 
ing the constraints on the set of hidden variables . 
We applied multiscale optimization in all image analysis 
experiments resulting in typical accelerations by factors 
10-100 compared to single-level optimization. 

6 Results 

6.1 Information Retrieval 

Information retrieval in large databases is one of the key 
topics in data mining. The problem is most severe in 
cases where the query cannot be formulated precisely, 
e.g., in natural language interfaces for documents or in 
image databases. Typically, one would like to obtain 
those entries which best match a given query according 
to some similarity measure. Yet, it is often difficult to 
reliably estimate similarities, because the query may not 
contain enough information, e.g., not all possibly rele¬ 
vant keywords might occur in a query for documents. 
Therefore, one often applies the cluster hypothesis [62]: 
if an entry is relevant to a query, similar entries may also 
be relevant to the query although they may not possess 
a high similarity to the query itself due to the small 
number of keywords. Clustering thus provides a way of 
pre-structuring a database for the purpose of improved 
information retrieval, cf. [63] for an overview. Both types 
of clustering approaches, for the set of documents as well 
as for the keywords, have been proposed in the litera¬ 
ture. The most frequently used methods in this context 
are linkage algorithms (single linkage, complete linkage, 
Wards method, cf. [26]), or hybrid combinations of ag- 
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Figure 7: (a) Generalization performance and (b) training likelihood for the annealed EM at different temperatures 
on the Cranfield collection. 
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Figure 8: (a) Training likelihood and (b) generalization performance for the (annealed) predictive EM variant at 
different temperatures on the Cranfield collection. 


glomerative and centroid-based methods [8] which have 
no probabilistic interpretation and have a number of 
other disadvantages. In contrast, COD mixture models 
provide a sound statistical basis and overcome the fun¬ 
damental sparseness problem of proximity-based clus¬ 
tering. In particular the hierarchical clustering model 
(HACM) has many additional features which make it a 
suitable candidate for interactive or coarse-to-fine infor¬ 
mation retrieval. 

We have performed experiments for information re¬ 
trieval on different collections of abstracts. The index 
terms for each dataset have been automatically extracted 
from all documents with the help of a standard word 
stemmer. Following [62], a list of stop words has been 
utilized to exclude frequently used words. Words with 
few overall occurrences have also been eliminated. The 
documents are identified with the set of objects T, while 
index terms correspond to y. 

In the first, series of experiments we have investi¬ 
gated how well the different models perform in pre¬ 
dicting the occurrences of certain words in the con¬ 


text. of a. particular document. Therefore, the set of 
all word occurrences has been divided into a. training 
and a. test set. From a. statistical point of view the 
canonical goodness-of-fit. measure is the average log- 
likelihood on the test set. Yet, in i lie context of nat¬ 
ural language processing it is more customary to utilize 
the perplexity V which is related to the average test 
set log-likelihood / by V = exp(—/). Since we have 
used the annealed EM algorithm, a. validation set was 
utilized to determine the optimal choice of the compu¬ 
tational temperature. Comparative results for all dis¬ 
cussed models 13 on the standard test collection Cra.n- 
field (CRAN, N=1400,M=1664,L=111803) are summa¬ 
rized in Table 2. For all experiments we have performed 
a. ten-fold cross validation. The main conclusions are: 

• The lowest perplexity is obtained with the SMM. 
The HACM performs better than the more con¬ 
strained ACM and SCM. Hence, in terms of per¬ 
plexity the linear mixture models should be pre- 

13 The performance of the PMM is comparable to the SMM 
and hence not. displayed. 
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Figure 9: Upper part, of a cluster hierarchy (6 levels) for the CLUSTER dataset generated by annealed EM for 
HACM. Each node is described by the index words t/j with the highest probability qj\ v . 


ferred over the clustering models. 

• The optimal temperature for the SMM is consis¬ 
tently below T = 1 which is the standard EM algo¬ 
rithm. For the clustering models the optimal gener¬ 
alization performance even requires a much higher 
temperature as expected. 

• Temperature-based complexity control clearly does 
much better than restricting the number K of com¬ 
ponents. Even the SMM with K = 8 components 
suffers from overhtt.ing at T = 1. 

To stress the advantages of the proposed EM variants, 
we have investigated the effect of a temperature-based 
regularization in more detail. Fig. 7 shows log-likelihood 
curves for typical runs of the annealed EM algorithm at 
different temperatures. 14 The overhtt.ing phenomenon is 
clearly visible, e.g., for the SMM at T = 1, where the 
test.—set. performance degrades after 30 iterations. No¬ 
tice, that, annealing performs much better than early 
stopping. A comparison of the predictive EM vari¬ 
ant. with the standard EM for the SMM is depicted in 
Fig. 8. This demonstrates that even the (presumably) 

14 These simulations have been performed on COD from 
the CRAN collection. Qualitatively similar results have been 
obtained for all other data sets. 


slight, modification to avoid positive feedback improves 
the test—set. performance. The overrelaxed EM variant, 
has also proven to be a. valuable tool in our simulations 
with a. typical acceleration by a. factor 2 — 3. 

To facilitate the assessment, of the extracted structure 
we have investigated a. da.t.a.set. of N = 1584 documents 
containing abstracts of papers with clustering as a. ti¬ 
tle word (CLUSTER). This data, is presumably more 
amenable to an interpretation by the reader than are 
the standard text, collections. The t.op-levels of a. cluster 
hierarchy generated by HACM are visualized in Fig. 9. 

To demonstrate the ability of the HACM to identify 
abstraction levels in the hierarchy, wfe' have visualized 
the distribution of the responsibility for observations in¬ 
volving the same index word t/j for some particularly 
interesting examples in Fig. 10. The first, tree for the 
word ‘cluster’ shows that, as expected, the occurrences 
of ‘cluster’ in documents are explained to be a. common 
feature of all documents, hence most, of the occurrences 
are assigned to the root.. The word ‘decision’ is found 
on a. level 3 node, indicating that. it. is a. typical word for 
all algorithmically oriented documents assigned to nodes 
in the subtree, but. e.g. not. for thfeleft. branch of papers 
from physics and astronomy. The index term ‘robust.’ oc¬ 
curs in two different, meanings: first., it. has a. highly spe- 




Figure 10: Exemplary relative word distributions over nodes for the CLUSTER dataset for the keywords ’cluster’, 
’decision’, ’glass’, ’robust’, ’segment’, and ’channel’. 


cific meaning in the context of stability analysis (‘plane’, 
‘perturb’, ‘eigenvalue’, ‘root’, etc.) and a rather broad 
meaning in the sense of robust methods arid algorithms. 
The word ‘segment’ occurs mainly in documents about 
computer vision and language processing, but it is used 
to a significant larger extend in the first field, ‘glass’ is 
a specific term in solid state physics, it thus is found on 
the lowest level of the hierarchy, ‘channel’ is again am¬ 
bivalent, it is used in the context of physics as well as in 
communication theory. The bimodal distribution clearly 
captures this fact. 

The same experiments have been carried out for a 
dataset of 1278 documents with abstracts from the jour¬ 
nals Neural Computation and Neural Netivorks (NN). 
The solution of a HACM with A" = 32 clusters is vi¬ 
sualized in Fig. 11, each node is again described by the 
index words with the highest probability. 

These examples are only spotlights, but they demon¬ 
strate that the hierarchical organization obtained by the 
HACM is able to extract interesting structure from co¬ 
occurrence data. A detailed investigation of its full po¬ 
tential in the context of information retrieval is beyond 
the scope of this paper and will be pursued in future 
work. 

6.2 Computational Linguistics 

In computational linguistics, the statistical analysis of 
word co-occurrences in lexical structures like adjec¬ 
tive/noun or verb/direct object has recently received a 


considerable degree of attention [22, 43, 9, 10]. Potential 
applications of these methods are in word-sense disam¬ 
biguation, a problem which occurs in different linguis¬ 
tic tasks ranging from parsing and tagging to machine 
translation. 

The data we have utilized to test the different mod¬ 
els consists of adjective—noun pairs extracted from a 
tagged version of the Penn Treebank corpus (N = 6931, 
M = 4995, L = 55214) and the LOB corpus {N = 5548, 
M = 6275, L = 36723) 15 . Performance results on the 
Penn dataset are reported in the second half of Table 2. 
The results are qualitatively very similar to the ones ob¬ 
tained on the CRAN document collection, although this 
application is quite different from the one in information 
retrieval. 

A result for a simultaneous hard clustering of the LOB 
data with the SCM is reported in Fig. 12. The visual¬ 
ization of the 7iv„ matrix reveals that many groups in 
either space are preferably combined with mainly one 
group in the complementary space. For example the ad¬ 
jective group ‘holy’, ‘divine’, ‘human’ has its occurrences 
almost exclusively with nouns from the cluster ‘life’, ‘na¬ 
ture’, ‘being’. Some groups are very much indifferent 
with respect to the groups in the corresponding set, e.g., 
the adjective group headed by ‘small’, ‘big’, ‘suitable’. ‘ 


5 Singular and plural forms have been identified. 
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Figure 11: Upper part, of a cluster hierarchy (6 levels) for the NN dataset generated by annealed EM for HACM. 
Each node is described by the index words t/j with the highest probability qj\ v . 


6.3 Unsupervised Texture Segmentation 

The unsupervised segmentation of textured images is one 
of the most challenging and still only partially solved 
problems in low level computer vision. Numerous ap¬ 
proaches to texture segmentation have been proposed 
over the past decades, most of which obey a t.wcMstage 
scheme: 

1. A modeling stage: characteristic features are ex¬ 
tracted from the textured input image, which range 
from spatial frequencies [25, 24], MRF-models 
[33, 34], co-occurrence matrices [16] to fractal in¬ 
dices [6]. 

2. In the clustering stage features are grouped into 
homogeneous segments, wheise homogeneity of fea¬ 
tures has to be formalized by a mathematical no¬ 
tion of similarity. 

Most widely, features are interpreted as vectors in a Eu¬ 
clidean space [25, 33, 34, 65, 40, 6, 31] and a segmenta¬ 
tion is obtained by minimizing the A"-mea.ns criterion, 
which sums over the square distances between feature 
vectors and their assigned, group-specific prototype fea¬ 
ture vectors. A"-mea.ns clustering can be understood as 
a statistical mixture model with isotropic Gaussian class 
distributions. 


Occasionally, the grouping process has been based on 
pairwise similarity measurements between image sites, 
where similarity is measured by a non-parametric sta¬ 
tistical test applied to the feature distribution of a 
surrounding neighborhood [16, 39, 24]. Agglomerative 
techniques [39] and, more rigorously, optimization ap¬ 
proaches [24, 57] have been developed and applied for 
the grouping of similarity data in the texture segmenta¬ 
tion context. Pairwise similarity clustering thus provides 
an indirect way to group (discrete) feature distributions 
without reducing information in a distribution to their 
mean. 

COD mixture models, especially the ACM model, for¬ 
malize the grouping of feature distribution in a more di¬ 
rect manner. In contrast to pairwise similarity cluster¬ 
ing, they offer a sound generative model for texture class 
description which can be utilized in subsequent process¬ 
ing stages like edge localization [56]. Furthermore, there 
is no need to compute a large matrix of pairwise simi¬ 
larity scores between image sites, which greatly reduces 
the overall processing time and memory requirements. 
Compared to the mixture of Gaussian model, ACM pro¬ 
vides significantly more flexibility in distribution model¬ 
ing. Especially in the texture segmentation application 
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Figure 12: Clustering of LOB using the SCM (K* = = 32) with a visualization of the tt„^ matrix and a 

characterization of clusters by their most probable words. 


modal distribution, which is the main reason for the suc¬ 
cess of pairwise similarity clustering approaches and the 
ACM compared to standard A’-means. 

The ACM is applicable to any feature extraction pro¬ 
cess. As an example we exploit a Gabor filter image 
representation. More specifically, in all experiments we 
used the modulus of a bank of Gabor filters with 4 ori¬ 
entations at 3 scales with an octave spacing, resulting in 
a 12 dimensional feature vector associated to each image 
site. In our experiments the features were discretized 
Separately for each dimension using 40 bins. Statistical 
independence of the feature channels has been assumed 
for simplicity. 

The feature generation process is then modeled as a 
co-occurrence of an image site Xi E X and a measured 
Gabor feature occurrence in one channel yj = f k 3 E T, 
where rj denotes the Gabor channel and kj denotes the 
index of the (discretized) feature. The sample set Si for 
site Xi consists of all Gabor responses in a window cen¬ 
tered at Xi, where the size of the window is chosen pro¬ 
portional to the filter scale [25]. Hence each image loca¬ 
tion Xi is effectively characterized by 12 one-dimensional 
histograms over Gabor coefficients. 

We have applied the ACM-based texture segmenta¬ 
tion algorithm to a collection of textured images. Fig. 13 
shows exemplary results for images which were ran¬ 
domly generated from the Brodatz texture collection of 
micro-textures. Fig. 14 shows similar results for mix¬ 
tures of aerial images. A detailed benchmark study of 
this novel segmentation algorithm including comparisons 
with state-of-the-art techniques will appear in a forth¬ 
coming paper. 


7 Conclusion 

As the main contribution of this paper a novel class of 
statistical models for the analysis of co-occurrence data 
has been proposed and evaluated. We have introduced 
and discussed several different models, not only by enu¬ 
merating them as alternative approaches, but by distin¬ 
guishing them from a systematic point of view. The 
criterion to differentiate between these models is the 
way hidden variables are introduced, which effectively 
imposes constraints on the component distributions of 
the mixture. Several recently proposed statistical mod¬ 
els have turned out to be special cases. All models have 
a sound statistical foundation in that they define a gen¬ 
erative distribution, and all of them can be fitted by an 
(approximate) EM algorithm. 

Which of these models is the method of choice for a 
given problem crucially depends on the modeling goal. 
As we have argued, it is often required to detect groups 
structure or hierarchical representations. In these situ¬ 
ations one may be willing to sacrifice some precision in 
terms of statistical accuracy (i.e., perplexity reduction) 
to extract the structure of interest. Within the proposed 
framework models have been derived to extract group 
structure on either one or both object spaces and to 
model hierarchical dependencies of clusters. We strongly 
believe the proposed framework is flexible enough to be 
adapted to many different tasks. The generality of the 
developed methods has been stressed by discussing their 
benefits in the context of a broad range of potential ap¬ 
plications. 

In addition to the modeling problem we have also ad¬ 
dressed computational issues, in particular focusing on 
improved variants of the basic EM algorithm. Most irn- 
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Figure 13: (a), (c) Two mixture images containing 5 textures each, (b), (d) The image segmentation obtained based 
on the ACM. 


portantly, our experiments underline the possible advan¬ 
tages of the annealed version of EM, which is a fruitful 
combination of ideas and methods from statistics and 
statistical physics. 
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Appendix A 

First, we establish an important relationship between 
the log-likelihood and a quantity known as free en¬ 
ergy in statistical physics [38]. Consider the data, log- 
likeliliood C = logP(J>|0, R) as a function of the discrete 
hidden states R over 7v for fixed parameters, and let 
H(R;S,9) = —C define a cost, function on the hidden 
variable space. Minimizing E \H(R; J>, 0)] w.r.t. prob¬ 
ability distributions over 7v subject to a constraint on 
the entropy yields a quantity which is known as the free 
energy in statistical physics. This is generalized to non- 
uniform priors by fixing the relative entropy with respect 
to a prior distribution 7r(J?,). Introducing a Lagrange pa¬ 
rameter T we arrive at the following objective function 
for probability distributions over the discrete space 7v 


T t {P\S, 9, 7T) 


E P [H(I)\ + TEp 


log 


P(RY 

tt{R)_ ' 


(64) 


The solution of the minimization problem associated 
with the generalized free energy in (64) is the (tilted) 
Gibbs distribution 


P(R\S, 9, 7r) esc 


7r( J?) exp 

n(R)[P(S\9,R)] 


--H(R;S,9) 


(65) 


For T = 1 this is exactly the posterior probability of 
R. The posterior thus minimizes Tx at T = 1. The 
annealed EM algorithm is the generalization defined by 
an arbitrary choice of (the temperature) T [60]. In the 


E-step for T > 1 this amounts to discounting the likeli¬ 
hood as compared to the prior by taking it to the 1 /T- 
th power. The M-step performs a minimization over 
Ep [H{I; S, 0)] and therefore Tx{P |<S, 9 , ir) with respect 
to 9 for fixed P where P is a Gibbs distribution which 
does not necessarily correspond to the true posterior. 
Notice that convergence of annealed EM is guaranteed 
since Tx (but not necessarily the likelihood itself) is a 
Lyapunov function. 

If an exact, calculation of the posterior in the E-step 
is intractable, typically because of higher order correla¬ 
tions between the hidden variables as in the SCM, the 
optimization problem in (64) is restricted to factorial dis¬ 
tributions P(R\p ra ) = Ilr Fla Pra“ ■ We make use of the 
more suggestive notation (R ra ) = Pra to stress that the 
variational parameters p ra can actually be thought of as 
an approximation of the posterior marginals. This vari¬ 
ational technique is known as mean-field approximation 
[42, 2] and has been successfully applied for optimization 
problems [61, 23], in computer vision [15, 67, 24], and for 
inference in graphical models [55]. 

In general, solutions of the mean-field approximation 
have to fulfill the stationary conditions 


{Pro/} — / Tffa eXp 


-~{'K[R\S,9,R ra = 1)) 


( 66 ) 


where expectations are taken with respect to P(R\p ra ) 
[24]. Notice that expected costs appear in the exponent, 
however the expectation is taken with respect to all hid¬ 
den variables except R ra itself which is fixed. In order 
to obtain a convergent iteration procedure in the general 
case one has to replace the ‘synchronous’ E-^step update 
by a ‘sequential’ update. In the SCM, the coupling be¬ 
tween hidden variables is restricted to pairs of variables 
Ii v and Jjp with m.j > 0 which allows us to recompute 
all posteriors for the I variables for given posteriors J in 
one sweep, and vice versa. 


Appendix B 

In order to preserve strict optimization principles, the 
predictive E-step variant has to be implemented slightly 
more careful than naively eliminating diagonal contribu¬ 
tions. Inserting the corrected estimates of 9 for given 
R into the complete data log-likelihood function (2) we 
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Figure 14: (a), (c) Two mixture images each containing 
7 textures extracted from aerial images, (b), (d) The 
image segmentation obtained based on the ACM. 

obtain the following cost, function for R 
L K 

7i(R; S) = EE ■R’V 01 h f' q/ (^67^) 

r —1 a =1 

h r a — log ' Rua T log ' Rua log ^ ' Rua , 

v u: t r uj^r 

>(«) = . (r) i(*)=i(r) 

which we refer to as predictive likelihood. Minimizing 
the free energy corresponding to (67) in a mean-held 
approximation yields a direct contribution proportional 
to exp \—h r a/T\ but also additional terms which result 
from the indirect effect R ra has on h sa for other vari¬ 
ables R sa . We omit the details of the derivation which 
is purely technical. As a consequence one has to utilize 
sequential update to guarantee convergence of predictive 
EM. For reasons of efficiency we have ignored the indirect 
effects in the computation of the posterior probabilities 
in our experiments which empirically turned out to work 
well. 
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